Objectives

  • Identify the basic workflow for conducting text analysis
  • Descriptive text visualization
    • Wordclouds
    • N-gram viewers
  • Geospatial visualization
  • Network analysis
  • Even more complex methods
    • Sentiment analysis
    • Latent semantic analysis
library(tidyverse)
library(knitr)
library(broom)
library(stringr)
library(modelr)
library(forcats)
library(tidytext)
library(twitteR)
library(wordcloud)
library(scales)

options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())

Basic workflow for text analysis

  • Obtain your text sources
  • Extract documents and move into a corpus
  • Transformation
  • Extract features
  • Perform analysis

Obtain your text sources

Text data can come from lots of areas:

  • Web sites
    • Twitter
  • Databases
  • PDF documents
  • Digital scans of printed materials

The easier to convert your text data into digitally stored text, the cleaner your results and fewer transcription errors.

Extract documents and move into a corpus

A text corpus is a large and structured set of texts. It typically stores the text as a raw character string with meta data and details stored with the text.

Transformation

Examples of typical transformations include:

  • Tagging segments of speech for part-of-speech (nouns, verbs, adjectives, etc.) or entity recognition (person, place, company, etc.)
  • Standard text processing - we want to remove extraneous information from the text and standardize it into a uniform format. This typically involves:
    • Converting to lower case
    • Removing punctuation
    • Removing numbers
    • Removing stopwords - common parts of speech that are not informative such as a, an, be, of, etc.
    • Removing domain-specific stopwords
    • Stemming - reduce words to their word stem
      • “Fishing”, “fished”, and “fisher” -> “fish”

Extract features

Feature extraction involves converting the text string into some sort of quantifiable measures. The most common approach is the bag-of-words model, whereby each document is represented as a vector which counts the frequency of each term’s appearance in the document. You can combine all the vectors for each document together and you create a term-document matrix:

  • Each row is a document
  • Each column is a term
  • Each cell represents the frequency of the term appearing in the document

However the bag-of-word model ignores context. You could randomly scramble the order of terms appearing in the document and still get the same term-document matrix.

Perform analysis

At this point you now have data assembled and ready for analysis. There are several approaches you may take when analyzing text depending on your research question. Basic approaches include:

  • Word frequency - counting the frequency of words in the text
  • Collocation - words commonly appearing near each other
  • Dictionary tagging - locating a specific set of words in the texts

More advanced methods include document classification, or assigning documents to different categories. This can be supervised (the potential categories are defined in advance of the modeling) or unsupervised (the potential categories are unknown prior to analysis). You might also conduct corpora comparison, or comparing the content of different groups of text. This is the approach used in plagiarism detecting software such as Turn It In. Finally, you may attempt to detect clusters of document features, known as topic modeling.

Descriptive text visualization

Wordclouds

So far we’ve used basic plots from ggplot2 to visualize our text data. However we could also use a word cloud to represent our text data. Also known as a tag cloud, word clouds visually represent text data by weighting the importance of each word, typically based on frequency in the text document. We can use the wordcloud package in R to generate these plots based on our tidied text data.

To draw the wordcloud, we need the data in tidy text format, so one-row-per-term. For example, here is a wordcloud of a set of tweets related to #rstats:

library(twitteR)
# You'd need to set global options with an authenticated app
setup_twitter_oauth(getOption("twitter_api_key"),
                    getOption("twitter_api_token"))
## [1] "Using browser based authentication"
library(wordcloud)

# get tweets
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"   # custom regular expression to tokenize tweets

rstats <- searchTwitter('#rstats', n = 3200) %>%
  twListToDF %>%
  as_tibble

# tokenize
rstats_token <- rstats %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

# plot
rstats_token %>%
  count(word) %>%
  filter(word != "#rstats") %>%
  with(wordcloud(word, n, max.words = 100))

Or tweets by Pope Francis:

# get tweets
pope <- userTimeline("Pontifex", n = 3200) %>%
  twListToDF %>%
  as_tibble

# tokenize
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"   # custom regular expression to tokenize tweets

pope_token <- pope %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

# plot
pope_token %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

We can even use wordclouds to compare words or tokens through the comparison.cloud() function. For instance, how do the tweets by Donald Trump compare to Pope Francis? In order to make this work, we need to convert our tidy data frame into a matrix first using the acast() function from reshape2, then use that for comparison.cloud().

library(reshape2)

# get fresh trump tweets
trump <- userTimeline("realDonaldTrump", n = 3200) %>%
  twListToDF %>%
  as_tibble

# tokenize
trump_token <- trump %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

bind_rows(Trump = trump_token, Pope = pope_token, .id = "person") %>%
  count(word, person) %>%
  acast(word ~ person, value.var = "n", fill = 0) %>%
  comparison.cloud(max.words = 100, colors = c("blue", "red"))

The size of a word’s text is in proportion to its frequency within its category (i.e. proportion of all Trump tweets or all pope tweets). We can use this visualization to see the most frequent words/hashtags by President Trump and Pope Francis, but the sizes of the words are not comparable across sentiments.

N-gram viewers

An n-gram is a contiguous sequence of \(n\) items from a given sequence of text or speech.

  • n-gram of size 1 = unigram
  • n-gram of size 2 = bigram
  • n-gram of size 3 = trigram
  • n-gram of size 4 = four-gram, etc.

This starts to incorporate context into our visualization. Rather than assuming all words/tokens are unique and independent from one another, n-grams of size 2 and up join together pairs or combinations of words in order to identify frequency within a document.

Geospatial visualization with text

Network analysis with text

Sentiment analysis

Sentiment analysis uses text analysis to estimate the attitude of a speaker or writer with respect to some topic or the overall polarity of the document. For example, the sentence

I am happy

contains words and language typically associated with positive feelings and emotions. Therefore if someone tweeted “I am happy”, we could make an educated guess that the person is expressing positive feelings.

Obviously it would be difficult for us to create a complete dictionary that classifies words based on their emotional affect; fortunately other scholars have already done this for us. Some simply classify words and terms as positive or negative:

get_sentiments("bing")
## # A tibble: 6,788 x 2
##           word sentiment
##          <chr>     <chr>
##  1     2-faced  negative
##  2     2-faces  negative
##  3          a+  positive
##  4    abnormal  negative
##  5     abolish  negative
##  6  abominable  negative
##  7  abominably  negative
##  8   abominate  negative
##  9 abomination  negative
## 10       abort  negative
## # ... with 6,778 more rows

Others rate them on a numeric scale:

get_sentiments("afinn")
## # A tibble: 2,476 x 2
##          word score
##         <chr> <int>
##  1    abandon    -2
##  2  abandoned    -2
##  3   abandons    -2
##  4   abducted    -2
##  5  abduction    -2
##  6 abductions    -2
##  7      abhor    -3
##  8   abhorred    -3
##  9  abhorrent    -3
## 10     abhors    -3
## # ... with 2,466 more rows

Still others rate words based on specific sentiments

get_sentiments("nrc")
## # A tibble: 13,901 x 2
##           word sentiment
##          <chr>     <chr>
##  1      abacus     trust
##  2     abandon      fear
##  3     abandon  negative
##  4     abandon   sadness
##  5   abandoned     anger
##  6   abandoned      fear
##  7   abandoned  negative
##  8   abandoned   sadness
##  9 abandonment     anger
## 10 abandonment      fear
## # ... with 13,891 more rows
get_sentiments("nrc") %>%
  count(sentiment)
## # A tibble: 10 x 2
##       sentiment     n
##           <chr> <int>
##  1        anger  1247
##  2 anticipation   839
##  3      disgust  1058
##  4         fear  1476
##  5          joy   689
##  6     negative  3324
##  7     positive  2312
##  8      sadness  1191
##  9     surprise   534
## 10        trust  1231

In order to assess the document or speaker’s overall sentiment, you simply count up the number of words associated with each sentiment. For instance, how positive or negative are Jane Austen’s novels? We can determine this by counting up the number of positive and negative words in each chapter, like so:

library(janeaustenr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

janeaustensentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

ggplot(janeaustensentiment, aes(index, sentiment, fill = book)) +
        geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
        facet_wrap(~book, ncol = 2, scales = "free_x")

Ignoring the specific code, this is a relatively simple operation. Once you have the text converted into a format suitable for analysis, tabulating and counting term frequency is not a complicated operation.

Exploring content of Donald Trump’s Twitter timeline

If you want to know what President Donald Trump personally tweets from his account versus his handlers, it looks like we might have a way of detecting this difference. Tweets from an iPhone are his staff; tweets from an Android are from him. Can we quantify this behavior or use text analysis to lend evidence to this argument? Yes.

Obtaining documents

library(twitteR)
# You'd need to set global options with an authenticated app
setup_twitter_oauth(getOption("twitter_api_key"),
                    getOption("twitter_api_token"))
## [1] "Using browser based authentication"
# We can request only 3200 tweets at a time; it will return fewer
# depending on the API
trump_tweets <- userTimeline("realDonaldTrump", n = 3200)
trump_tweets_df <- trump_tweets %>%
  map_df(as.data.frame) %>%
  tbl_df()
# if you want to follow along without setting up Twitter authentication,
# just use this dataset:
load(url("http://varianceexplained.org/files/trump_tweets_df.rda"))
str(trump_tweets_df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1512 obs. of  16 variables:
##  $ text         : chr  "My economic policy speech will be carried live at 12:15 P.M. Enjoy!" "Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8" "#ICYMI: \"Will Media Apologize to Trump?\" https://t.co/ia7rKBmioA" "Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton "| __truncated__ ...
##  $ favorited    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ favoriteCount: num  9214 6981 15724 19837 34051 ...
##  $ replyToSN    : chr  NA NA NA NA ...
##  $ created      : POSIXct, format: "2016-08-08 15:20:44" "2016-08-08 13:28:20" ...
##  $ truncated    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ replyToSID   : logi  NA NA NA NA NA NA ...
##  $ id           : chr  "762669882571980801" "762641595439190016" "762439658911338496" "762425371874557952" ...
##  $ replyToUID   : chr  NA NA NA NA ...
##  $ statusSource : chr  "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>" ...
##  $ screenName   : chr  "realDonaldTrump" "realDonaldTrump" "realDonaldTrump" "realDonaldTrump" ...
##  $ retweetCount : num  3107 2390 6691 6402 11717 ...
##  $ isRetweet    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ retweeted    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ longitude    : chr  NA NA NA NA ...
##  $ latitude     : chr  NA NA NA NA ...

Clean up the data

Let’s next clean up the data frame by selecting only the relevant columns, extracting from statusSource the name of the application used to generate the Tweet, and filter for only tweets from an iPhone or an Android phone. The extract() function uses a regular expression to extract the app name.

tweets <- trump_tweets_df %>%
  select(id, statusSource, text, created) %>%
  extract(statusSource, "source", "Twitter for (.*?)<") %>%
  filter(source %in% c("iPhone", "Android"))

tweets %>%
  head() %>%
  knitr::kable(caption = "Example of Donald Trump tweets")
Example of Donald Trump tweets
id source text created
762669882571980801 Android My economic policy speech will be carried live at 12:15 P.M. Enjoy! 2016-08-08 15:20:44
762641595439190016 iPhone Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8 2016-08-08 13:28:20
762439658911338496 iPhone #ICYMI: “Will Media Apologize to Trump?” https://t.co/ia7rKBmioA 2016-08-08 00:05:54
762425371874557952 Android Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton flunky! 2016-08-07 23:09:08
762400869858115588 Android The media is going crazy. They totally distort so many things on purpose. Crimea, nuclear, “the baby” and so much more. Very dishonest! 2016-08-07 21:31:46
762284533341417472 Android I see where Mayor Stephanie Rawlings-Blake of Baltimore is pushing Crooked hard. Look at the job she has done in Baltimore. She is a joke! 2016-08-07 13:49:29

Comparison of words

What can we say about the difference in the content? We can use the tidytext package to analyze this.

We start by dividing into individual words using the unnest_tokens() function, and removing some common stopwords. This is a common aspect to preparing text for analysis. Typically, tokens are single words from a document. However they can also be (bi-grams) (pairs of words), tri-grams (three-word sequences), n-grams (\(n\)-length sequences of words), or in this case, individual words, hashtags, or references to other Twitter users. Because tweets are a special form of text (they can include words, urls, references to other users, hashtags, etc.) we need to use a custom regular expression to convert the text into tokens.

library(tidytext)

reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"   # custom regular expression to tokenize tweets

# function to neatly print the first 10 rows using kable
print_neat <- function(df){
  df %>%
    head() %>%
    knitr::kable()
}

# tweets data frame
tweets %>%
  print_neat()
id source text created
762669882571980801 Android My economic policy speech will be carried live at 12:15 P.M. Enjoy! 2016-08-08 15:20:44
762641595439190016 iPhone Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8 2016-08-08 13:28:20
762439658911338496 iPhone #ICYMI: “Will Media Apologize to Trump?” https://t.co/ia7rKBmioA 2016-08-08 00:05:54
762425371874557952 Android Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton flunky! 2016-08-07 23:09:08
762400869858115588 Android The media is going crazy. They totally distort so many things on purpose. Crimea, nuclear, “the baby” and so much more. Very dishonest! 2016-08-07 21:31:46
762284533341417472 Android I see where Mayor Stephanie Rawlings-Blake of Baltimore is pushing Crooked hard. Look at the job she has done in Baltimore. She is a joke! 2016-08-07 13:49:29
# remove manual retweets
tweets %>%
  filter(!str_detect(text, '^"')) %>%
  print_neat()
id source text created
762669882571980801 Android My economic policy speech will be carried live at 12:15 P.M. Enjoy! 2016-08-08 15:20:44
762641595439190016 iPhone Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8 2016-08-08 13:28:20
762439658911338496 iPhone #ICYMI: “Will Media Apologize to Trump?” https://t.co/ia7rKBmioA 2016-08-08 00:05:54
762425371874557952 Android Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton flunky! 2016-08-07 23:09:08
762400869858115588 Android The media is going crazy. They totally distort so many things on purpose. Crimea, nuclear, “the baby” and so much more. Very dishonest! 2016-08-07 21:31:46
762284533341417472 Android I see where Mayor Stephanie Rawlings-Blake of Baltimore is pushing Crooked hard. Look at the job she has done in Baltimore. She is a joke! 2016-08-07 13:49:29
# remove urls
tweets %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  print_neat()
id source text created
762669882571980801 Android My economic policy speech will be carried live at 12:15 P.M. Enjoy! 2016-08-08 15:20:44
762641595439190016 iPhone Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: 2016-08-08 13:28:20
762439658911338496 iPhone #ICYMI: “Will Media Apologize to Trump?” 2016-08-08 00:05:54
762425371874557952 Android Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton flunky! 2016-08-07 23:09:08
762400869858115588 Android The media is going crazy. They totally distort so many things on purpose. Crimea, nuclear, “the baby” and so much more. Very dishonest! 2016-08-07 21:31:46
762284533341417472 Android I see where Mayor Stephanie Rawlings-Blake of Baltimore is pushing Crooked hard. Look at the job she has done in Baltimore. She is a joke! 2016-08-07 13:49:29
# unnest into tokens - tidytext format
tweets %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  print_neat()
id source created word
676494179216805888 iPhone 2015-12-14 20:09:15 record
676494179216805888 iPhone 2015-12-14 20:09:15 of
676494179216805888 iPhone 2015-12-14 20:09:15 health
676494179216805888 iPhone 2015-12-14 20:09:15 #makeamericagreatagain
676494179216805888 iPhone 2015-12-14 20:09:15 #trump2016
676509769562251264 iPhone 2015-12-14 21:11:12 another
# remove stop words
tweets %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]")) %>%
  print_neat()
id source created word
676494179216805888 iPhone 2015-12-14 20:09:15 record
676494179216805888 iPhone 2015-12-14 20:09:15 health
676494179216805888 iPhone 2015-12-14 20:09:15 #makeamericagreatagain
676494179216805888 iPhone 2015-12-14 20:09:15 #trump2016
676509769562251264 iPhone 2015-12-14 21:11:12 accolade
676509769562251264 iPhone 2015-12-14 21:11:12 @trumpgolf
# store for future use
tweet_words <- tweets %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

What were the most common words in Trump’s tweets overall?

Yeah, sounds about right.

Assessing word and document frequency

One measure of how important a word may be is its term frequency (tf), how frequently a word occurs within a document. The problem with this approach is that some words occur many times in a document, yet are probably not important (e.g. “the”, “is”, “of”). Instead, we want a way of downweighting words that are common across all documents, and upweighting words that are frequent within a small set of documents.

Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term’s tf-idf, the frequency of a term adjusted for how rarely it is used. It is intended to measure how important a word is to a document in a collection (or corpus) of documents. It is a rule-of-thumb or heuristic quantity, not a theoretically proven method. The inverse document frequency for any given term is defined as:

\[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\]

To calculate tf-idf for this set of documents, we will pool all the tweets from iPhone and Android together and treat them as if they are two total documents. Then we can calculate the frequency of terms in each group, and standardize that relative to the the term’s frequency across the entire corpus.

tweet_words_count <- tweet_words %>%
  count(source, word, sort = TRUE) %>%
  ungroup()
tweet_words_count
## # A tibble: 3,235 x 3
##     source                   word     n
##      <chr>                  <chr> <int>
##  1  iPhone             #trump2016   171
##  2 Android                hillary   124
##  3  iPhone #makeamericagreatagain    95
##  4 Android                crooked    93
##  5 Android                clinton    66
##  6 Android                 people    64
##  7  iPhone                hillary    52
##  8 Android                   cruz    50
##  9 Android                    bad    43
## 10  iPhone                america    43
## # ... with 3,225 more rows
total_words <- tweet_words_count %>%
  group_by(source) %>%
  summarize(total = sum(n))
total_words
## # A tibble: 2 x 2
##    source total
##     <chr> <int>
## 1 Android  4901
## 2  iPhone  3852
tweet_words_count <- left_join(tweet_words_count, total_words)
tweet_words_count
## # A tibble: 3,235 x 4
##     source                   word     n total
##      <chr>                  <chr> <int> <int>
##  1  iPhone             #trump2016   171  3852
##  2 Android                hillary   124  4901
##  3  iPhone #makeamericagreatagain    95  3852
##  4 Android                crooked    93  4901
##  5 Android                clinton    66  4901
##  6 Android                 people    64  4901
##  7  iPhone                hillary    52  3852
##  8 Android                   cruz    50  4901
##  9 Android                    bad    43  4901
## 10  iPhone                america    43  3852
## # ... with 3,225 more rows
tweet_words_count <- tweet_words_count %>%
  bind_tf_idf(word, source, n)
tweet_words_count
## # A tibble: 3,235 x 7
##     source                   word     n total      tf   idf tf_idf
##      <chr>                  <chr> <int> <int>   <dbl> <dbl>  <dbl>
##  1  iPhone             #trump2016   171  3852 0.04439 0.000 0.0000
##  2 Android                hillary   124  4901 0.02530 0.000 0.0000
##  3  iPhone #makeamericagreatagain    95  3852 0.02466 0.693 0.0171
##  4 Android                crooked    93  4901 0.01898 0.000 0.0000
##  5 Android                clinton    66  4901 0.01347 0.000 0.0000
##  6 Android                 people    64  4901 0.01306 0.000 0.0000
##  7  iPhone                hillary    52  3852 0.01350 0.000 0.0000
##  8 Android                   cruz    50  4901 0.01020 0.000 0.0000
##  9 Android                    bad    43  4901 0.00877 0.000 0.0000
## 10  iPhone                america    43  3852 0.01116 0.000 0.0000
## # ... with 3,225 more rows

Which terms have a high tf-idf?

tweet_words_count %>%
  select(-total) %>%
  arrange(desc(tf_idf))
## # A tibble: 3,235 x 6
##     source                   word     n      tf   idf  tf_idf
##      <chr>                  <chr> <int>   <dbl> <dbl>   <dbl>
##  1  iPhone #makeamericagreatagain    95 0.02466 0.693 0.01709
##  2  iPhone                   join    42 0.01090 0.693 0.00756
##  3  iPhone          #americafirst    27 0.00701 0.693 0.00486
##  4  iPhone             #votetrump    23 0.00597 0.693 0.00414
##  5  iPhone             #imwithyou    20 0.00519 0.693 0.00360
##  6  iPhone        #crookedhillary    17 0.00441 0.693 0.00306
##  7  iPhone          #trumppence16    14 0.00363 0.693 0.00252
##  8  iPhone                    7pm    11 0.00286 0.693 0.00198
##  9  iPhone                  video    11 0.00286 0.693 0.00198
## 10 Android                  badly    13 0.00265 0.693 0.00184
## # ... with 3,225 more rows
tweet_important <- tweet_words_count %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word))))

tweet_important %>%
  group_by(source) %>%
  slice(1:15) %>%
  ggplot(aes(word, tf_idf, fill = source)) +
  geom_bar(alpha = 0.8, stat = "identity") +
  labs(title = "Highest tf-idf words in @realDonaldTrump",
       subtitle = "Top 15 for Android and iPhone",
       x = NULL, y = "tf-idf") +
  coord_flip()

  • Most hashtags come from the iPhone. Indeed, almost no tweets from Trump’s Android contained hashtags, with some rare exceptions like this one. (This is true only because we filtered out the quoted “retweets”, as Trump does sometimes quote tweets like this that contain hashtags).

  • Words like “join”, and times like “7pm”, also came only from the iPhone. The iPhone is clearly responsible for event announcements like this one (“Join me in Houston, Texas tomorrow night at 7pm!”)

  • A lot of “emotionally charged” words, like “badly” and “dumb”, were overwhelmingly more common on Android. This supports the original hypothesis that this is the “angrier” or more hyperbolic account.

Sentiment analysis

Since we’ve observed a difference in sentiment between the Android and iPhone tweets, let’s try quantifying it. We’ll work with the NRC Word-Emotion Association lexicon, available from the tidytext package, which associates words with 10 sentiments: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

nrc <- sentiments %>%
  filter(lexicon == "nrc") %>%
  select(word, sentiment)
nrc
## # A tibble: 13,901 x 2
##           word sentiment
##          <chr>     <chr>
##  1      abacus     trust
##  2     abandon      fear
##  3     abandon  negative
##  4     abandon   sadness
##  5   abandoned     anger
##  6   abandoned      fear
##  7   abandoned  negative
##  8   abandoned   sadness
##  9 abandonment     anger
## 10 abandonment      fear
## # ... with 13,891 more rows

To measure the sentiment of the Android and iPhone tweets, we can count the number of words in each category:

sources <- tweet_words %>%
  group_by(source) %>%
  mutate(total_words = n()) %>%
  ungroup() %>%
  distinct(id, source, total_words)
sources
## # A tibble: 1,172 x 3
##                    id  source total_words
##                 <chr>   <chr>       <int>
##  1 676494179216805888  iPhone        3852
##  2 676509769562251264  iPhone        3852
##  3 680496083072593920 Android        4901
##  4 680503951440121856 Android        4901
##  5 680505672476262400 Android        4901
##  6 680734915718176768 Android        4901
##  7 682764544402440192  iPhone        3852
##  8 682792967736848385  iPhone        3852
##  9 682805320217980929  iPhone        3852
## 10 685490467329425408 Android        4901
## # ... with 1,162 more rows
by_source_sentiment <- tweet_words %>%
  inner_join(nrc, by = "word") %>%
  count(sentiment, id) %>%
  ungroup() %>%
  complete(sentiment, id, fill = list(n = 0)) %>%
  inner_join(sources) %>%
  group_by(source, sentiment, total_words) %>%
  summarize(words = sum(n)) %>%
  ungroup()

head(by_source_sentiment)
## # A tibble: 6 x 4
##    source    sentiment total_words words
##     <chr>        <chr>       <int> <dbl>
## 1 Android        anger        4901   321
## 2 Android anticipation        4901   256
## 3 Android      disgust        4901   207
## 4 Android         fear        4901   268
## 5 Android          joy        4901   199
## 6 Android     negative        4901   560

(For example, we see that 321 of the 4901 words in the Android tweets were associated with “anger”). We then want to measure how much more likely the Android account is to use an emotionally-charged term relative to the iPhone account. Since this is count data, we can use a Poisson test to measure the difference:

# function to calculate the poisson.test for a given sentiment
poisson_test <- function(df){
  poisson.test(df$words, df$total_words)
}

# use the nest() and map() functions to apply poisson_test to each sentiment and 
# extract results using broom::tidy()
sentiment_differences <- by_source_sentiment %>%
  group_by(sentiment) %>%
  nest() %>%
  mutate(poisson = map(data, poisson_test),
         poisson_tidy = map(poisson, tidy)) %>%
  unnest(poisson_tidy, .drop = TRUE)
sentiment_differences
## # A tibble: 10 x 9
##       sentiment estimate statistic  p.value parameter conf.low conf.high
##           <chr>    <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
##  1        anger     1.49       321 2.19e-05       274    1.235      1.81
##  2 anticipation     1.17       256 1.19e-01       240    0.960      1.43
##  3      disgust     1.68       207 1.78e-05       170    1.312      2.16
##  4         fear     1.56       268 1.89e-05       226    1.264      1.93
##  5          joy     1.00       199 1.00e+00       199    0.809      1.24
##  6     negative     1.69       560 7.09e-13       459    1.459      1.97
##  7     positive     1.06       555 3.82e-01       541    0.930      1.21
##  8      sadness     1.62       303 1.15e-06       252    1.326      1.99
##  9     surprise     1.17       159 2.17e-01       149    0.908      1.51
## 10        trust     1.13       369 1.47e-01       351    0.960      1.33
## # ... with 2 more variables: method <fctr>, alternative <fctr>

And we can visualize it with a 95% confidence interval:

sentiment_differences %>%
  ungroup() %>%
  mutate(sentiment = reorder(sentiment, estimate)) %>%
  mutate_each(funs(. - 1), estimate, conf.low, conf.high) %>%
  ggplot(aes(estimate, sentiment)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
  scale_x_continuous(labels = percent_format()) +
  labs(x = "% increase in Android relative to iPhone",
       y = "Sentiment")

Thus, Trump’s Android account uses about 40-80% more words related to disgust, sadness, fear, anger, and other “negative” sentiments than the iPhone account does. (The positive emotions weren’t different to a statistically significant extent).

We’re especially interested in which words drove this different in sentiment. Let’s consider the words with the largest changes within each category:

tweet_important %>%
  inner_join(nrc, by = "word") %>%
  filter(!sentiment %in% c("positive", "negative")) %>%
  mutate(sentiment = reorder(sentiment, -tf_idf),
         word = reorder(word, -tf_idf)) %>%
  group_by(sentiment) %>%
  top_n(10, tf_idf) %>%
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = source)) +
  facet_wrap(~ sentiment, scales = "free", nrow = 4) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(x = "",
       y = "tf-idf") +
  scale_fill_manual(name = "", labels = c("Android", "iPhone"),
                    values = c("red", "lightblue"))

This confirms that lots of words annotated as negative sentiments are more common in Trump’s Android tweets than the campaign’s iPhone tweets. It’s no wonder Trump’s staff took away his tweeting privileges for the remainder of the campaign.

Latent semantic analysis

Text documents can be utilized in computational text analysis under the bag of words approach.1 Documents are represented as vectors, and each variable counts the frequency a word appears in a given document. While we throw away information such as word order, we can represent the information in a mathematical fashion using a matrix. Each row represents a single document, and each column is a different word:

 a abandoned abc ability able about above abroad absorbed absorbing abstract
43         0   0       0    0    10     0      0        0         0        1

These vectors can be very large depending on the dictionary, or the number of unique words in the dataset. These bag-of-words vectors have three important properties:

  1. They are sparse. Most entries in the matrix are zero.
  2. A small number of words appear frequently across all documents. These are typically uninformative words called stop words that inform us nothing about the document (e.g. “a”, “an”, “at”, “of”, “or”).
  3. Other than these words, the other words in the dataset are correlated with some words but not others. Words typically come together in related bunches.

Considering these three properties, we probably don’t need to keep all of the words. Instead, we could reduce the dimensionality of the data by projecting the larger dataset into a smaller feature space with fewer dimensions that summarize most of the variation in the data. Each dimension would represent a set of correlated words.

In a textual context, this process is known as latent semantic analysis. By identifying words that are closely related to one another, when searching for just one of the terms we can find documents that use not only that specific term but other similar ones. Think about how you search for information online. You normally identify one or more keywords, and search for pages that are related to those words. But search engines use techniques such as LSA to retrieve results not only for pages that use your exact word(s), but also pages that use similar or related words.

Interpretation: NYTimes

# get NYTimes data
load("data/pca-examples.Rdata")

Let’s look at an application of LSA. nyt.frame contains a document-term matrix of a random sample of stories from the New York Times: 57 stories are about art, and 45 are about music. The first column identifies the topic of the article, and each remaining cell contains a frequency count of the number of times each word appeared in that article.2 The resulting data frame contains 102 rows and 4432 columns.

Some examples of words appearing in these articles:

colnames(nyt.frame)[sample(ncol(nyt.frame),30)]
##  [1] "penchant"  "brought"   "structure" "willing"   "yielding" 
##  [6] "bare"      "school"    "halls"     "challenge" "step"     
## [11] "largest"   "lovers"    "intense"   "borders"   "mall"     
## [16] "classic"   "conducted" "mirrors"   "hole"      "location" 
## [21] "desperate" "published" "head"      "paints"    "another"  
## [26] "starts"    "familiar"  "window"    "thats"     "broker"

We can estimate the LSA using the standard PCA procedure:

# Omit the first column of class labels
nyt.pca <- prcomp(nyt.frame[,-1])

# Extract the actual component directions/weights for ease of reference
nyt.latent.sem <- nyt.pca$rotation

# convert to data frame
nyt.latent.sem <- nyt.latent.sem %>%
  as_tibble %>%
  mutate(word = names(nyt.latent.sem[,1])) %>%
  select(word, everything())

Let’s extract the biggest components for the first principal component:

nyt.latent.sem %>%
  select(word, PC1) %>%
  arrange(PC1) %>%
  slice(c(1:10, (n() - 10):n())) %>%
  mutate(pos = ifelse(PC1 > 0, TRUE, FALSE),
         word = fct_reorder(word, PC1)) %>%
  ggplot(aes(word, PC1, fill = pos)) +
  geom_col() +
  labs(title = "LSA analysis of NYTimes articles",
       x = NULL,
       y = "PC1 scores") +
  coord_flip() +
  theme(legend.position = "none")

These are the 10 words with the largest positive and negative loadings on the first principal component. The words on the positive loading seem associated with music, whereas the words on the negative loading are more strongly associated with art.

nyt.latent.sem %>%
  select(word, PC2) %>%
  arrange(PC2) %>%
  slice(c(1:10, (n() - 10):n())) %>%
  mutate(pos = ifelse(PC2 > 0, TRUE, FALSE),
         word = fct_reorder(word, PC2)) %>%
  ggplot(aes(word, PC2, fill = pos)) +
  geom_col() +
  labs(title = "LSA analysis of NYTimes articles",
       x = NULL,
       y = "PC2 scores") +
  coord_flip() +
  theme(legend.position = "none")

Here the positive words are about art, but more focused on acquiring and trading (“donations”, “tax”). We could perform similar analysis on each of the 103 principal components, but if the point of LSA/PCA is to reduce the dimensionality of the data, let’s just focus on the first two for now.

biplot(nyt.pca, scale = 0, cex = .6)

cbind(type = nyt.frame$class.labels, as_tibble(nyt.pca$x[,1:2])) %>%
  mutate(type = factor(type, levels = c("art", "music"),
                       labels = c("A", "M"))) %>%
  ggplot(aes(PC1, PC2, label = type, color = type)) +
  geom_text() +
  labs(title = "")

  theme(legend.position = "none")
## List of 1
##  $ legend.position: chr "none"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

The biplot looks a bit ridiculous because there are 4432 variables to map onto the principal components. Only a few are interpretable. If we instead just consider the articles themselves, even after throwing away the vast majority of information in the original data set the first two principal components still strongly distinguish the two types of articles. If we wanted to use PCA to reduce the dimensionality of the data and predict an article’s topic using a method such as SVM, we could probably generate a pretty good model using just the first two dimensions of the PCA rather than all the individual variables (words).

Session Info

devtools::session_info()
##  setting  value                       
##  version  R version 3.3.3 (2017-03-06)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2017-05-22                  
## 
##  package      * version    date       source                            
##  assertthat     0.2.0      2017-04-11 cran (@0.2.0)                     
##  backports      1.0.5      2017-01-18 CRAN (R 3.3.2)                    
##  base         * 3.3.3      2017-03-07 local                             
##  bit            1.1-12     2014-04-09 CRAN (R 3.3.0)                    
##  bit64          0.9-7      2017-05-08 CRAN (R 3.3.2)                    
##  broom        * 0.4.2      2017-02-13 CRAN (R 3.3.2)                    
##  cellranger     1.1.0      2016-07-27 CRAN (R 3.3.0)                    
##  colorspace     1.3-2      2016-12-14 CRAN (R 3.3.2)                    
##  curl           2.6        2017-04-27 CRAN (R 3.3.2)                    
##  datasets     * 3.3.3      2017-03-07 local                             
##  DBI            0.6-1      2017-04-01 CRAN (R 3.3.2)                    
##  devtools       1.13.0     2017-05-08 CRAN (R 3.3.2)                    
##  digest         0.6.12     2017-01-27 CRAN (R 3.3.2)                    
##  dplyr        * 0.5.0      2016-06-24 CRAN (R 3.3.0)                    
##  evaluate       0.10       2016-10-11 CRAN (R 3.3.0)                    
##  forcats      * 0.2.0      2017-01-23 CRAN (R 3.3.2)                    
##  foreign        0.8-68     2017-04-24 CRAN (R 3.3.2)                    
##  ggplot2      * 2.2.1.9000 2017-05-12 Github (tidyverse/ggplot2@f4398b6)
##  graphics     * 3.3.3      2017-03-07 local                             
##  grDevices    * 3.3.3      2017-03-07 local                             
##  grid           3.3.3      2017-03-07 local                             
##  gtable         0.2.0      2016-02-26 CRAN (R 3.3.0)                    
##  haven          1.0.0      2016-09-23 cran (@1.0.0)                     
##  hms            0.3        2016-11-22 CRAN (R 3.3.2)                    
##  htmltools      0.3.6      2017-04-28 cran (@0.3.6)                     
##  httr           1.2.1      2016-07-03 CRAN (R 3.3.0)                    
##  janeaustenr    0.1.4      2016-10-26 CRAN (R 3.3.0)                    
##  jsonlite       1.4        2017-04-08 cran (@1.4)                       
##  knitr        * 1.15.1     2016-11-22 cran (@1.15.1)                    
##  lattice        0.20-35    2017-03-25 CRAN (R 3.3.2)                    
##  lazyeval       0.2.0      2016-06-12 CRAN (R 3.3.0)                    
##  lubridate      1.6.0      2016-09-13 CRAN (R 3.3.0)                    
##  magrittr       1.5        2014-11-22 CRAN (R 3.3.0)                    
##  Matrix         1.2-10     2017-04-28 CRAN (R 3.3.2)                    
##  memoise        1.1.0      2017-04-21 CRAN (R 3.3.2)                    
##  methods      * 3.3.3      2017-03-07 local                             
##  mnormt         1.5-5      2016-10-15 CRAN (R 3.3.0)                    
##  modelr       * 0.1.0      2016-08-31 CRAN (R 3.3.0)                    
##  munsell        0.4.3      2016-02-13 CRAN (R 3.3.0)                    
##  nlme           3.1-131    2017-02-06 CRAN (R 3.3.3)                    
##  openssl        0.9.6      2016-12-31 CRAN (R 3.3.2)                    
##  parallel       3.3.3      2017-03-07 local                             
##  plyr           1.8.4      2016-06-08 CRAN (R 3.3.0)                    
##  psych          1.7.5      2017-05-03 CRAN (R 3.3.3)                    
##  purrr        * 0.2.2.2    2017-05-11 CRAN (R 3.3.3)                    
##  R6             2.2.1      2017-05-10 CRAN (R 3.3.2)                    
##  RColorBrewer * 1.1-2      2014-12-07 CRAN (R 3.3.0)                    
##  Rcpp           0.12.10    2017-03-19 cran (@0.12.10)                   
##  readr        * 1.1.0      2017-03-22 cran (@1.1.0)                     
##  readxl         1.0.0      2017-04-18 CRAN (R 3.3.2)                    
##  reshape2       1.4.2      2016-10-22 CRAN (R 3.3.0)                    
##  rjson          0.2.15     2014-11-03 cran (@0.2.15)                    
##  rlang          0.1.9000   2017-05-12 Github (hadley/rlang@c17568e)     
##  rmarkdown      1.5        2017-04-26 CRAN (R 3.3.2)                    
##  rprojroot      1.2        2017-01-16 CRAN (R 3.3.2)                    
##  rvest          0.3.2      2016-06-17 CRAN (R 3.3.0)                    
##  scales       * 0.4.1      2016-11-09 CRAN (R 3.3.1)                    
##  slam           0.1-40     2016-12-01 CRAN (R 3.3.2)                    
##  SnowballC      0.5.1      2014-08-09 cran (@0.5.1)                     
##  stats        * 3.3.3      2017-03-07 local                             
##  stringi        1.1.5      2017-04-07 CRAN (R 3.3.2)                    
##  stringr      * 1.2.0      2017-02-18 CRAN (R 3.3.2)                    
##  tibble       * 1.3.0.9002 2017-05-12 Github (tidyverse/tibble@9103a30) 
##  tidyr        * 0.6.2      2017-05-04 CRAN (R 3.3.2)                    
##  tidytext     * 0.1.2      2016-10-28 CRAN (R 3.3.0)                    
##  tidyverse    * 1.1.1      2017-01-27 CRAN (R 3.3.2)                    
##  tokenizers     0.1.4      2016-08-29 CRAN (R 3.3.0)                    
##  tools          3.3.3      2017-03-07 local                             
##  twitteR      * 1.1.9      2015-07-29 CRAN (R 3.3.0)                    
##  utils        * 3.3.3      2017-03-07 local                             
##  withr          1.0.2      2016-06-20 CRAN (R 3.3.0)                    
##  wordcloud    * 2.5        2014-06-13 CRAN (R 3.3.0)                    
##  xml2           1.1.1      2017-01-24 CRAN (R 3.3.2)                    
##  yaml           2.1.14     2016-11-12 cran (@2.1.14)

  1. This section drawn from 18.3 in “Principal Component Analysis”..

  2. Actually it contains the term frequency-inverse document frequency which downweights words that appear frequently across many documents. This is one method for guarding against any biases caused by stop words.